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1. Introduction 

We introduce and we analyze a new dataset which re- 
sembles the input to biological vision systems much more 
than most previously pubUshed ones. Our analysis leaded 
to several important conclusions. First, it is possible to dis- 
ambiguate over dozens of visual scenes (locations) encoun- 
tered over the course of several weeks of a human life with 
accuracy of over 80%, and this opens up possibility for nu- 
merous novel vision applications, from early detection of 
dementia to everyday use of wearable camera streams for 
automatic reminders, and visual stream exchange. Second, 
our experimental results indicate that, generative models 
such as Latent Dirichlet Allocation |21 or Counting Grids 
m, are more suitable to such types of data, as they are more 
robust to overtraining and comfortable with images at low 
resolution, blurred and characterized by relatively random 
clutter and a mix of objects. 

2. Data Acquisition 

To gather the data, a subject wore a SenseCarrT] during 
all waking hours for three weeks. The camera was rarely 
turned off, except during potentially sensitive moments. 
The SenseCam snapshots^ are automatically triggered by 
sudden changes in the visual field, or by default every 
45s. On average the snapshots were taken every 20s or so. 
This translated into ^2k images a day with a resolution of 
640 X 480, for a total of 43522 images. 

We selected a random 10% of the data, where we found 
that the recurrent types of scenes fell into 32 classes. About 
15% of images in this random selection belong to spurious 
types of scenes with only one or two examples. 
With the help of the original subject, we started from a 
few examples of each of the 32 recurring scenes and then 
we manually labeled the rest of the selected frames. In 
case of some classes, this procedure did not yield enough 
images for proper training and testing, and for these classes 
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we looked at the whole dataset again and extracted more 
examples of each of these classes for both testing and 
training. This process yielded to a total of 3959 labeled 
images. Some images for each class are shown in FiglT] 

During acquisition, each image is time-tagged, enabling 
us to illustrate in Fig|2^ the number of different days that 
each class was seen, while in Fig|2]3 we show the distribu- 
tion over time of day (moming/afternoon/evening/night) 
for each class. 

We also labeled the images coming from two whole days 
to test if the timestamp information can help recognition. 
The number of images of each day is, respectively 2043 and 
1703. 

To download the dataset, please send an email to the au- 
thor£] 

3. Experiments 

In the following experiments, we used SIFT features 
||6l, extracted from 16x16 patches spaced 8 pixels apart, 
clustered in Z=200 visual words or gist descriptors lO, ex- 
tracted on 4 scales with 8 orientations per scale. We com- 
pared the performances of generative, discriminative and 
epitome-like methods. As discriminative methods we em- 
ployed a Support Vector Machine with linear kernel on gist 
||8l and on quantized sift histograms and the Spatial Pyra- 
mid Kernel (SPK, 3 Levels) |5|. As generative models we 
considered Fei-Fei's semi-supervised Latent Dirichlet Allo- 
cation (LDA) |4|, Counting Grids (CG) Ql, and a mixture 
of Dirichlet distributions over the quantized sift histograms. 
The last approach is similar in spirit to |7| where a mixture 
of gaussians over gist descriptors is learnt for each class. 
Finally we also tried epitome-like approaches: Structural 
Epitome |3 1, Epitomic Location Recognitiorr||9| and FFT's 
Counting Grids tlj (an hybrid between epitomes and Count- 
ing Grids). 
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Figure 1. From top left to bottom right, we show three images from each class of SenseCam-32. The 32 classes are: Bathroom Home, Bedroom, 

Biking, Cafeteria, Car, Classroom, Conference Room, Corridors Work, Dining Room, Bakery, Garage, Atrium, Entry, Hiking Trail, Ice Palace, Kids Bedroom, Game Room, 
Kitchen, Living Room, Lounge, Home Office, Campus, Parking Work, Patio, Playground, Restroom Work, Small Bathroom Home, Small Home Office, Tennis Court, Food Court, 
Grocery Store, Work Office. 



a) No. Days a location is visited {training set) 




I. 



Il 





rll- 

■Oil 

1 ^ < 



1 1 1 1 I 

o o J o 

o o -5 o 

-6 °= K = 



HI 



,^ i: O |2 . 
" o £ c £ 



S iE 

QJ O 



E E E E oi E 
o o o o E o 



■^ q! s; ° ° 



s s 



Figure 2. Statistics on SenseCam Dataset 



3.1. Scene Classification 

In this section we provide some baseline on the 
scene/place classification task. Since some categories have 
much more images than others, we used 15 images per class 
for training and at most 15 images of each class as test set. 
We repeated the experiments for 5 times, averaging the re- 
sults. Classification accuracies are shown in Tab. [T] where 
we report the best result obtained by each method. 
Generative classifiers are built learning a model for each 
class with the training data, and assigning to a test sam- 
ple the class that produces higher likelihood. As visible 
in TablT] they reach the best results, moreover their per- 
formance doesn't vary much for a "reasonable" choice of 
the parameter setting (Topic number of LDA, Capacity for 



Counting Grids). The poor accuracies of discriminative 
methods |5, 8|, are clearly do to overtraining but with this 
type of datasets we must expect scarce labeled data. Inter- 
estingly, the methods based on pixelwise comparisons fail 
on this extended dataset as they cannot capture well the ge- 
ometric transformations. Being a hybrid between epitomes 
and counting grids, |2| reaches decent results, but it also 
finds it hard to generalize with so little labeled data. 

3.2. Where was I? 

As second test we asked how many places we could cor- 
rectly guess using i) each of the generative models consid- 
ered in the previous section, and ii) an HMM over possi- 
ble classes cj, capturing transition constraints such as living 



Table 1. Scene Classification Results. We reported the best result 
obtained. 

Discriminative Methods 
Gist + SVM 18] SIFT + SVM SPK |5 1 



38,10% 49,94% 52,76% 

Generative Methods 
LDA 1 4 1 Dirichlet Mixt. 1 7 1 CGs 1 1 1 



58,05% 49,19% 54,43% 

Epitome-like Methods 
Stel Ep. 11 CG-fft |2j Feature Ep. |J9] 



15,02% 



39,40% 



21,32% 



room - kitchen, or office - coiTidor - atrium, etc. 
We considered labeled images from two days. During each 
day, the camera bearer visited roughly 20 of the labeled lo- 
cations and the two days share only 12 locations. Neverthe- 
less we trained models with all the 32 classes as a-priori we 
cannot know the locations visited during a day. Our goal is 
to compute the place posterior probabilities at the instant t, 
given all the previous images P{ct — k\xi-t)- We used the 
forward-backwards procedure to recursively compute it, in 
formulae: 

P{ct = k\xi;t) oc p{xt\ct = k) ■ P{ct = k\xi;t-i) = 



p[xt\ct = fc) • ^ P{ct = k\ 



Ct-l 



c) ■ P{ct-i\xi;t-i) 



We fixed HMM's observation loglikelihood (e.g., p{xt\ct = 
k)) to the negative free energy of the generative model 
in hand, while we used EM estimate the transition ma- 
trix Afe|c = P{ct — k\ct-i = c), the place prior iik — 
P{ci = k) and the place posteriors an unsupervised way, 
simply fitting the likelihood to the day's images. We used a 
strong dirichlet prior over self transitions to favor the stay in 
the same state/location. Finally since the observation like- 
lihood terms, often dominate the effects of the transition 
prior, we adopt the standard solution of re-scaling the like- 
lihood terms. 

This approach is very similar to |7|, therefore the reference 
provides a natural point of a comparison. Besides a simi- 
lar use of the HMM the idea of Q is to learn a mixture of 
model for each class to eventually compute the observation 
likelihood. 

We used at most 30 images per class to learn the models. 
Results are reported in Tab|2] as expected, the recognition 
accuracy rises respectively when we "turn on" the HMM 
(Eq. [T}. For sake of completeness we have also imple- 
mented the original method of |7| using their descriptors 
from the whole images and within the four sectors. Their 
performances were lower ( below 50%). 
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Table 2. Where was I? 

"HMM off" 

Dirichlet Mixt. -JT) CGs lH] 



62.21' 



LDA |4| 



54.68% 
"HMM on " 
Dirichlet Mixt. 



66.76% 



M CGs Q 



76.80% 70.37% 81.21% 



4. Conclusion 

In this extended abstract we presented a large dataset 
which differently from others is totally natural as it rep- 
resents the visual input of a subject. Using our labels, it 
would be easier to analyze the full data collected in H. We 
also showed how temporal information can be used to reach 
compelling accuracies on classification of all the locations 
a subject visits during a day. 
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